By: Sareet Nayak
Police brutality is a public health crisis in the US. Police brutality is a term used to describe the unwarranted and excessive force police officers (and other law enforcement) use to kill and injure normal civilians, whether it is verbal or physical abuse. Black people are affected by police brutality at the highest rate. After the 2014 fatal shooting of Michael Brown, at the hands of a police officer, the “Black Lives Matter” movement grew which is a political and social movement to increase accountability and awareness of police brutality and racially motivated acts of violence against black people. I do not stand for the unjust treatment of anyone, especially not black people in light of police brutality and have partaken in protests and demonstrations in my own community and will continue for as long as this problem persists.
While reading further on the “Black Lives Matter” movement, I found that the Washington Post has published data of many fatal shootings between 2015 and 2022, in the US. I am saying many, not all, because many causes of police brutality are undocumented. This data includes the victim’s name, gender, age, race, threat level, and signs of mental illness, in addition, to data on the circumstances of the murder which includes the data, manner of death, whether the victim was armed, fleeing, or threatening, the location of the murder, and whether or not the police’s body camera was off.
Through my involvement and research, I was interested in understanding the variables that affect police brutality and from there possibly provide insights that policymakers will hopefully be able to act on.
This semester, I am taking a class called “CMSC 320: Introduction to Data Science” where we had the opportunity to learn how to tell stories with data, through the scope of the industry standard data science pipeline. The data science pipeline includes: 1) data collection 2) data management and representation 3) exploratory data analysis 4) hypothesis testing and 5) communication of insights attained. Through the process of this project, I will walk you through the entire data science pipeline in the context of my project. Enjoy!
#Making imports to the relevant Python libraries
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
from datetime import datetime
import folium
from geopy.geocoders import Nominatim
import requests
from bs4 import BeautifulSoup
import pgeocode
import plotly.express as px
import math
from collections import defaultdict
import plotly.express as px
import seaborn as sns
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
I imported my data into a dataframe called “police_killings”. This process was fairly simple because all of the data was readily accessible from Kaggle (https://www.kaggle.com/datasets/kwullum/fatal-police-shootings-in-the-us).
#Reading information on all of the police killings and storing it in a dataframe called "police_killings"
police_killings = pd.read_csv('PoliceKillingsUS.csv', encoding='cp1252')
In the data management stage, I did data transformation and data cleaning. The difference between data transformation and data cleaning is that data transformation focuses on restructuring the data to be in a more usable format, while data cleaning focuses on removing unnecessary contents and modifying inaccuracies. In terms of cleaning, I removed the manner of death and age column as they were not relevant to my studies and removed any entries which had a missing value in the column of armed, race, or flee because those were the only ones with missing values and missing values would not be relevant to my study. In terms of transformation, I converted the date column to be in the proper datetime format ('%m-%d-%Y') because this would be easier to plot, changed gender entries from “M” and “F” to “Male" and "Female" to make the table more readable, changed race entries from “A”, “W”, “H”, “B”, “N”, and “O” to “Asian”, “White, “Hispanic”, “Black”, “Native”, and “Other” respectively, and changed the columns to be more specific and readable. Additionally, I added a new column which represents if the victim was armed or not, depending on the value of the “Armed With” column.
#Removing columns which are irrelevant to my study
police_killings = police_killings.drop(['manner_of_death','age'], axis=1)
#Dropping any entries which have missing values in any of the columns: "armed", "race", and "flee"
police_killings = police_killings.dropna(subset=['armed', 'race', 'flee'])
#Iterating through the dataframe "police_killings" and changing the "date" column to be in a proper datetime format
fixed_dates = []
for index, row in police_killings.iterrows():
date_str = ""
date_str += row['date'].split('/')[1]
date_str += '-'
date_str += row['date'].split('/')[0]
date_str += '-20'
date_str += row['date'].split('/')[2]
date_object = datetime.strptime(date_str, '%m-%d-%Y').date()
fixed_dates.append(date_object)
police_killings['date'] = fixed_dates
#Iterating through the dataframe "police_killings" and changing any entries whose gender column is "M" to "Male" and otherwise
fixed_gender = []
for index, row in police_killings.iterrows():
if (row['gender'] == 'M'):
fixed_gender.append('Male')
else:
fixed_gender.append('Female')
police_killings['gender'] = fixed_gender
#Iterating throough the dataframe "police_killings" and changining any entries whose race is "A" to "Asian" and otherwise
fixed_race = []
for index, row in police_killings.iterrows():
if (row['race'] == 'A'):
fixed_race.append('Asian')
elif (row['race'] == 'W'):
fixed_race.append('White')
elif (row['race'] == 'H'):
fixed_race.append('Hispanic')
elif (row['race'] == 'B'):
fixed_race.append('Black')
elif (row['race'] == 'N'):
fixed_race.append('Native')
else:
fixed_race.append('Other')
police_killings['race'] = fixed_race
#Resetting the column names to make them more intuitive and reasonable
police_killings.columns = ['ID', 'Name', 'Date', 'Armed With', 'Gender', 'Race', 'City', 'State', 'Has Signs of Mental Illness','Threat Level', 'Fleeing', 'Has Body Camera']
After doing data cleansing, I was able to ensure that I had no missing values. This was good, because I did not have to do extra steps (removal or imputation) to modify the missing data. Removal is straightforward, while imputation is the process of replacing missing values with educated guesses like mean or median.
columns = police_killings.columns
print("Missing Value Count per Column")
#Printing the number of missing values (NaNs) per column
for column in columns:
print(str(column) + " has " + str(police_killings[column].isnull().sum()) + " missing values.")
Missing Value Count per Column ID has 0 missing values. Name has 0 missing values. Date has 0 missing values. Armed With has 0 missing values. Gender has 0 missing values. Race has 0 missing values. City has 0 missing values. State has 0 missing values. Has Signs of Mental Illness has 0 missing values. Threat Level has 0 missing values. Fleeing has 0 missing values. Has Body Camera has 0 missing values.
#Converting the well formatted "Date" column into a datetime object with the following format
police_killings['Date'] = pd.to_datetime(police_killings['Date'], format="%Y-%m-%d", utc=False)
armed = []
#Creating a new column for whether or not the victim was armed or not. Iterating throough the dataframe "police_killings" and append "No" when the "Armed With" column said "unarmed" and "Yes" otherwise
for index, row in police_killings.iterrows():
if (row['Armed With'] == 'unarmed'):
armed.append('No')
else:
armed.append('Yes')
police_killings['Armed'] = armed
#Fixing the formatting of all the items within the "Threat Level" and "Fleeing" columns so that they match title case
police_killings["Threat Level"] = police_killings["Threat Level"].str.title()
police_killings["Fleeing"] = police_killings["Fleeing"].str.title()
#Printing the first few rows of the "police_killings" dataframe
police_killings
| ID | Name | Date | Armed With | Gender | Race | City | State | Has Signs of Mental Illness | Threat Level | Fleeing | Has Body Camera | Armed | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3 | Tim Elliot | 2015-01-02 | gun | Male | Asian | Shelton | WA | True | Attack | Not Fleeing | False | Yes |
| 1 | 4 | Lewis Lee Lembke | 2015-01-02 | gun | Male | White | Aloha | OR | False | Attack | Not Fleeing | False | Yes |
| 2 | 5 | John Paul Quintero | 2015-01-03 | unarmed | Male | Hispanic | Wichita | KS | False | Other | Not Fleeing | False | No |
| 3 | 8 | Matthew Hoffman | 2015-01-04 | toy weapon | Male | White | San Francisco | CA | True | Attack | Not Fleeing | False | Yes |
| 4 | 9 | Michael Rodriguez | 2015-01-04 | nail gun | Male | Hispanic | Evans | CO | False | Attack | Not Fleeing | False | Yes |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2523 | 2808 | Kesharn K. Burney | 2017-07-26 | vehicle | Male | Black | Dayton | OH | False | Attack | Car | False | Yes |
| 2525 | 2820 | Deltra Henderson | 2017-07-27 | gun | Male | Black | Homer | LA | False | Attack | Car | False | Yes |
| 2528 | 2812 | Alejandro Alvarado | 2017-07-27 | knife | Male | Hispanic | Chowchilla | CA | False | Attack | Not Fleeing | False | Yes |
| 2533 | 2817 | Isaiah Tucker | 2017-07-31 | vehicle | Male | Black | Oshkosh | WI | False | Attack | Car | True | Yes |
| 2534 | 2815 | Dwayne Jeune | 2017-07-31 | knife | Male | Black | Brooklyn | NY | True | Attack | Not Fleeing | False | Yes |
2282 rows × 13 columns
Through exploratory data analysis and visualization, I wanted to answer some questions: 1) How does the frequency of police killings change overtime? 2) Is there a relationship with demographic data and the frequency of killings per state? 3) How does location affect the frequency of police killings?
#Creating a new column in the dataframe which mimics the date, but only contains the year and month
police_killings['year_month'] = police_killings['Date'].map(lambda dt: dt.strftime('%Y-%m'))
#Grouping the "police_killings" dataframe using the new "year_month" column that was defined
grouped_ym = police_killings.groupby(police_killings['year_month']).size().to_frame("count").reset_index()
plt.figure(figsize=(8, 8))
#Plotting the groupby object against count, which is an aggregator which will fundamentally count the number of entries (which in this case is police killings) by year/month
grouped_ym.plot(kind = "line", x = 'year_month', y = "count")
plt.title('Frequency of Police Killings (per month, 2015-2017)')
plt.xlabel('Date')
plt.ylabel('Frequency of Police Killings')
Text(0, 0.5, 'Frequency of Police Killings')
<Figure size 576x576 with 0 Axes>
There is a decreasing trend in the frequency of killings overtime. However, there are some random peaks and I am interested in understanding what are causing those disruptions from the trend.
At this point, I import some more demographic data, again from https://www.kaggle.com/datasets/kwullum/fatal-police-shootings-in-the-us. The following code blocks convert the original dataframe called “police_killings” into a new dataframe called “state_info”. Unlike the original dataframe, this new data frame includes each state’s victim count (across 2015 and 2017) and different demographic information such as the percent of males, females, hispanic people, white people, black people, native people, asian people, other raced people, diploma holders, poverty rate, and more. It’s important to mention, there’s a column called “Percent Victims of Population”. I wanted to better represent the victim count because the states which have a greater population would have a higher victim count, so I made the victim count a percentage of the entire population of the state.
#Creating a dictionary which contains state names as the key and the abbreviation as the values
state_abbrev = { "Alabama": "AL", "Alaska": "AK", "Arizona": "AZ", "Arkansas": "AR", "California": "CA", "Colorado": "CO", "Connecticut": "CT", "Delaware": "DE", "Florida": "FL", "Georgia": "GA", "Hawaii": "HI", "Idaho": "ID", "Illinois": "IL", "Indiana": "IN", "Iowa": "IA", "Kansas": "KS", "Kentucky": "KY", "Louisiana": "LA", "Maine": "ME", "Maryland": "MD", "Massachusetts": "MA", "Michigan": "MI", "Minnesota": "MN", "Mississippi": "MS", "Missouri": "MO", "Montana": "MT", "Nebraska": "NE", "Nevada": "NV", "New Hampshire": "NH", "New Jersey": "NJ", "New Mexico": "NM", "New York": "NY", "North Carolina": "NC", "North Dakota": "ND", "Ohio": "OH", "Oklahoma": "OK", "Oregon": "OR", "Pennsylvania": "PA", "Rhode Island": "RI", "South Carolina": "SC", "South Dakota": "SD", "Tennessee": "TN", "Texas": "TX", "Utah": "UT", "Vermont": "VT", "Virginia": "VA", "Washington": "WA", "West Virginia": "WV", "Wisconsin": "WI", "Wyoming": "WY", "District of Columbia": "DC", "American Samoa": "AS", "Guam": "GU", "Northern Mariana Islands": "MP", "Puerto Rico": "PR", "United States Minor Outlying Islands": "UM", "U.S. Virgin Islands": "VI", }
#Reading in a new dataset, which contains the percent of males/females per state, into a new dataframe called "state_gender"
state_gender = pd.read_csv('state_gender.csv')
state_gender = state_gender.iloc[:, :-3]
state_gender = state_gender.iloc[:-1]
states = []
for index, row in state_gender.iterrows():
states.append(state_abbrev[row['State']])
#Converting the entries which are objects with "%" at the end into floats where each percent is a whole number. For example, 48% would be written as 48
state_gender['State'] = states
state_gender['Male'] = state_gender['Male'].str[:-1]
state_gender['Male'] = state_gender['Male'].astype('float64')
state_gender['Female'] = state_gender['Female'].str[:-1]
state_gender['Female'] = state_gender['Female'].astype('float64')
#Reading in a new dataset, which contains the percent of different races per state, into a new dataframe called "state_info"
#This new "state_info" dataframe will be augmented to (or merged) with other demographic information
state_info = pd.read_csv('state_race.csv')
state_info = state_info.iloc[:-5]
hispanic = []
white = []
black = []
native = []
asian = []
other = []
states = []
#Converting the decimal representation of a percent into floats where each decimal is a whole number. For example, 0.48 becomes 48
for index, row in state_info.iterrows():
states.append(state_abbrev[row['State']])
hispanic.append((row['Hispanic'] / row['Total']) * 100)
white.append((row['White'] / row['Total']) * 100)
black.append((row['Black'] / row['Total']) * 100)
native.append((row['Indian'] / row['Total']) * 100)
asian.append((row['Asian'] / row['Total']) * 100)
other.append(((row['Other'] + row['Hawaiian'])/ row['Total']) * 100)
state_info.drop('Hawaiian', axis=1, inplace=True)
state_info.drop('Total', axis=1, inplace=True)
state_info.rename({'Indian': 'Native'}, axis=1, inplace=True)
state_info['State'] = states
state_info['White'] = white
state_info['Black'] = black
state_info['Hispanic'] = hispanic
state_info['Asian'] = asian
state_info['Native'] = native
state_info['Other'] = other
#Joining the "state_gender" dataframe into the "state_info" dataframe so now the "state_info" dataframe includes information on gender, in addition, to race
state_info = pd.merge(state_gender, state_info, left_on='State', right_on='State', how='inner')
#Creating a dictionary called "state_killing_counts" where the keys are the states and the values are the number of entries per state in the "police_killings" dataframe. This dictionary has been alphabetically sorted by the key
state_killing_counts = police_killings["State"].value_counts()
state_killing_counts = dict(state_killing_counts)
state_killing_counts = dict(sorted(state_killing_counts.items()))
#Creating a dictionary called "state_population" where the keys are states and the values population per state. This data was found online. This dictionary has been alphabetically sorted by the key
state_population = {'CA': 39538223, 'TX': 29145505, 'FL': 21538187, 'NY': 20201249, 'PA': 13002700, 'IL': 12801989, 'OH': 11799448, 'GA': 10711908, 'NC': 10439388, 'MI': 10077331, 'NJ': 9288994, 'VA': 8631393, 'WA': 7705281, 'AZ': 7151502, 'MA': 7029917, 'TN': 6910840, 'IN': 6785528, 'MD': 6177224, 'MO': 6154913, 'WI': 5893718, 'CO': 5773714, 'MN': 5706494, 'SC': 5118425, 'AL': 5024279, 'LA': 4657757, 'KY': 4505836, 'OR': 4237256, 'OK': 3959353, 'CT': 3605944, 'UT': 3205958, 'IA': 3271616, 'NV': 3104614, 'AR': 3011524, 'MS': 2961279, 'KS': 2937880, 'NM': 2117522, 'NE': 1961504, 'ID': 1839106, 'WV': 1793716, 'HI': 1455271, 'NH': 1377529, 'ME': 1362359, 'RI': 1097379, 'MT': 1084225, 'DE': 989948, 'SD': 886667, 'ND': 779094, 'AK': 733391, 'DC': 689545, 'VT': 643077, 'WY': 576851}
state_population = dict(sorted(state_population.items()))
#Creating three different lists based on the keys/values of "state_killing_counts" and "state_population"
states = state_killing_counts.keys()
killings = state_killing_counts.values()
population = state_population.values()
#Reading in a new dataset, which contains the poverty rate per state, into a new dataframe called "state_poverty"
state_poverty = pd.read_csv('poverty_rates.csv')
state = []
for index, row in state_poverty.iterrows():
state.append(state_abbrev[row[0]])
state_poverty['State'] = state
#Creating a dictionary called "state_poverty" where the keys are states and the values are poverty per state. This dictionary has been alphabetically sorted by the key
state_poverty = dict(zip(state_poverty['State'], state_poverty['Poverty Rate']))
state_poverty = dict(sorted(state_poverty.items()))
#Creating a list based on the values of "state_poverty"
poverty = state_poverty.values()
#Reading in a new dataset, which contains the percent of people >25y/o with a high diploma, into a new dataframe called "state_diploma"
state_diploma = pd.read_csv('state_diploma.csv')
state = []
for index, row in state_diploma.iterrows():
state.append(state_abbrev[row[0]])
state_diploma['State'] = state
state_diploma = state_diploma.drop(columns=['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4', 'Unnamed: 5'])
#Creating a dictionary called "state_diploma" where the keys are states and the percent of people >25y/o with a high diploma per state. This dictionary has been alphabetically sorted by the key
state_diploma = dict(zip(state_diploma['State'], state_diploma['Percent Over 25 With a High School Diploma or higher']))
state_diploma = dict(sorted(state_diploma.items()))
state_diploma.pop('PR')
#Creating a list based on the values of "state_diploma"
diploma = state_diploma.values()
killing_col = []
population_col = []
proportion_col = []
poverty_col = []
diploma_col = []
#Iterating through the "state_info" dataframe and appending to new lists (which will be added to the dataframe) based on the state, using the different dictionaries that were defined above
for index, row in state_info.iterrows():
killing_col.append(state_killing_counts[row['State']])
population_col.append(state_population[row['State']])
proportion_col.append((state_killing_counts[row['State']] / state_population[row['State']]) * 100)
poverty_col.append(state_poverty[row['State']])
diploma_col.append(state_diploma[row['State']])
state_info['Victim Count'] = killing_col
state_info['Population'] = population_col
state_info['Proportion of Victims'] = proportion_col
state_info['Poverty Rate'] = poverty_col
state_info['Percent of Diploma Holders'] = diploma_col
#Renaming the columns to be more meaninful to someone who is not familiar with the dataframe
state_info = state_info.rename(columns={'Male': 'Percent Male', 'Female': 'Percent Female', 'Hispanic': 'Percent Hispanic','White': 'Percent White',
'Black': 'Percent Black', 'Native':'Percent Native', 'Asian': 'Percent Asian',
'Other': 'Percent Other Races', 'Proportion of Victims': 'Percent Victims of Population',
'Poverty Rate':'Percent Poverty Rate', 'Percent of Diploma Holders': 'Percent Diploma Holders'})
state_info
| State | Percent Male | Percent Female | Percent Hispanic | Percent White | Percent Black | Percent Native | Percent Asian | Percent Other Races | Victim Count | Population | Percent Victims of Population | Percent Poverty Rate | Percent Diploma Holders | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | AL | 48.1 | 51.9 | 4.351991 | 2.262064 | 0.171034 | 0.070854 | 0.012098 | 1.835941 | 45 | 5024279 | 0.000896 | 15.98 | 87.93 |
| 1 | AK | 51.2 | 48.8 | 7.199419 | 3.661108 | 0.159297 | 0.468799 | 0.194575 | 2.715641 | 14 | 733391 | 0.001909 | 10.34 | 93.31 |
| 2 | AZ | 49.5 | 50.5 | 31.511985 | 19.637070 | 0.266683 | 0.539722 | 0.085614 | 10.982896 | 105 | 7151502 | 0.001468 | 14.12 | 88.97 |
| 3 | AR | 49.0 | 51.0 | 7.624126 | 3.813574 | 0.078191 | 0.107375 | 0.024603 | 3.600384 | 20 | 3011524 | 0.000664 | 16.08 | 88.67 |
| 4 | CA | 49.7 | 50.3 | 39.091445 | 19.540923 | 0.275990 | 0.457238 | 0.229576 | 18.587719 | 374 | 39538223 | 0.000946 | 12.58 | 84.45 |
| 5 | CO | 50.2 | 49.8 | 21.655972 | 14.016981 | 0.202325 | 0.428044 | 0.061003 | 6.947619 | 63 | 5773714 | 0.001091 | 9.78 | 92.43 |
| 6 | CT | 48.9 | 51.1 | 16.445986 | 8.179470 | 0.843708 | 0.117545 | 0.041310 | 7.263953 | 7 | 3605944 | 0.000194 | 9.78 | 91.11 |
| 7 | DE | 48.1 | 51.9 | 9.440114 | 5.926655 | 0.482495 | 0.074818 | 0.028832 | 2.927314 | 8 | 989948 | 0.000808 | 11.44 | 91.36 |
| 8 | DC | 47.5 | 52.5 | 11.108816 | 4.346885 | 0.850459 | 0.162399 | 0.059119 | 5.689954 | 11 | 689545 | 0.001595 | 15.45 | 92.79 |
| 9 | FL | 48.8 | 51.2 | 25.775772 | 18.232454 | 0.706761 | 0.078169 | 0.052750 | 6.705637 | 137 | 21538187 | 0.000636 | 13.34 | 89.79 |
| 10 | GA | 48.3 | 51.7 | 9.632952 | 5.153539 | 0.420888 | 0.166680 | 0.039271 | 3.852574 | 61 | 10711908 | 0.000569 | 14.28 | 88.97 |
| 11 | HI | 49.1 | 50.9 | 10.743525 | 2.574725 | 0.111755 | 0.099150 | 0.872771 | 7.085124 | 11 | 1455271 | 0.000756 | 9.26 | 92.93 |
| 12 | ID | 49.9 | 50.1 | 12.709256 | 7.036726 | 0.055747 | 0.237294 | 0.051187 | 5.328304 | 16 | 1839106 | 0.000870 | 11.94 | 91.26 |
| 13 | IL | 49.2 | 50.8 | 17.227648 | 8.938293 | 0.236502 | 0.162557 | 0.055913 | 7.834383 | 57 | 12801989 | 0.000445 | 11.99 | 90.17 |
| 14 | IN | 49.3 | 50.7 | 7.099934 | 3.836122 | 0.143440 | 0.054757 | 0.024325 | 3.041291 | 40 | 6785528 | 0.000589 | 12.91 | 90.64 |
| 15 | IA | 49.9 | 50.1 | 6.171629 | 4.078462 | 0.086444 | 0.078825 | 0.013429 | 1.914470 | 12 | 3271616 | 0.000367 | 11.11 | 93.32 |
| 16 | KS | 49.8 | 50.2 | 12.071678 | 7.628083 | 0.207374 | 0.154157 | 0.046659 | 4.035406 | 24 | 2937880 | 0.000817 | 11.44 | 91.89 |
| 17 | KY | 49.1 | 50.9 | 3.764025 | 2.164232 | 0.116138 | 0.028889 | 0.013783 | 1.440984 | 41 | 4505836 | 0.000910 | 16.61 | 87.99 |
| 18 | LA | 48.2 | 51.8 | 5.217407 | 2.921269 | 0.247866 | 0.055953 | 0.020452 | 1.971866 | 50 | 4657757 | 0.001073 | 18.65 | 86.68 |
| 19 | ME | 49.1 | 50.9 | 1.726027 | 1.037943 | 0.065184 | 0.065855 | 0.008428 | 0.548617 | 10 | 1362359 | 0.000734 | 11.07 | 94.53 |
| 20 | MD | 48.2 | 51.8 | 10.259301 | 4.083626 | 0.487377 | 0.076802 | 0.039784 | 5.571712 | 36 | 6177224 | 0.000583 | 9.02 | 91.09 |
| 21 | MA | 48.9 | 51.1 | 12.049173 | 5.772804 | 0.685014 | 0.073563 | 0.046748 | 5.471044 | 22 | 7029917 | 0.000313 | 9.85 | 91.10 |
| 22 | MI | 49.3 | 50.7 | 5.225665 | 3.080839 | 0.176029 | 0.071226 | 0.021135 | 1.876436 | 36 | 10077331 | 0.000357 | 13.71 | 91.96 |
| 23 | MN | 49.9 | 50.1 | 5.494034 | 2.672403 | 0.094229 | 0.109800 | 0.038249 | 2.579352 | 31 | 5706494 | 0.000543 | 9.33 | 94.13 |
| 24 | MS | 48.1 | 51.9 | 3.163891 | 1.623665 | 0.162853 | 0.035347 | 0.006070 | 1.335956 | 22 | 2961279 | 0.000743 | 19.58 | 86.49 |
| 25 | MO | 48.9 | 51.1 | 4.289192 | 2.467326 | 0.090347 | 0.059241 | 0.021293 | 1.650986 | 58 | 6154913 | 0.000942 | 13.01 | 91.59 |
| 26 | MT | 50.7 | 49.3 | 3.908901 | 2.201647 | 0.093058 | 0.264009 | 0.012904 | 1.337283 | 11 | 1084225 | 0.001015 | 12.78 | 94.35 |
| 27 | NE | 49.8 | 50.2 | 11.173152 | 6.971992 | 0.111860 | 0.181773 | 0.021052 | 3.886474 | 14 | 1961504 | 0.000714 | 10.37 | 92.16 |
| 28 | NV | 50.1 | 49.9 | 28.901544 | 13.889438 | 0.362277 | 0.388974 | 0.178531 | 14.082324 | 35 | 3104614 | 0.001127 | 12.78 | 87.16 |
| 29 | NH | 50.0 | 50.0 | 3.895387 | 2.403479 | 0.199374 | 0.022284 | 0.013503 | 1.256748 | 7 | 1377529 | 0.000508 | 7.42 | 94.44 |
| 30 | NJ | 49.0 | 51.0 | 20.427604 | 10.819300 | 0.771455 | 0.135030 | 0.070959 | 8.630860 | 30 | 9288994 | 0.000323 | 9.67 | 90.98 |
| 31 | NM | 49.0 | 51.0 | 49.202559 | 33.319170 | 0.262325 | 0.728891 | 0.112874 | 14.779299 | 41 | 2117522 | 0.001936 | 18.55 | 87.48 |
| 32 | NY | 48.7 | 51.3 | 19.066030 | 7.141987 | 1.357582 | 0.174769 | 0.086765 | 10.304927 | 43 | 20201249 | 0.000213 | 13.58 | 88.03 |
| 33 | NC | 48.3 | 51.7 | 9.541973 | 4.969572 | 0.336012 | 0.116009 | 0.027132 | 4.093248 | 67 | 10439388 | 0.000642 | 13.98 | 89.70 |
| 34 | ND | 51.3 | 48.7 | 3.988064 | 1.997123 | 0.065755 | 0.262364 | 0.019464 | 1.643359 | 4 | 779094 | 0.000513 | 10.53 | 93.62 |
| 35 | OH | 49.0 | 51.0 | 3.939428 | 2.171324 | 0.173383 | 0.047459 | 0.016496 | 1.530765 | 72 | 11799448 | 0.000610 | 13.62 | 91.74 |
| 36 | OK | 49.5 | 50.5 | 10.925035 | 6.245952 | 0.144885 | 0.379557 | 0.029752 | 4.124890 | 67 | 3959353 | 0.001692 | 15.27 | 88.71 |
| 37 | OR | 49.7 | 50.3 | 13.223976 | 7.676256 | 0.088570 | 0.235637 | 0.061776 | 5.161737 | 32 | 4237256 | 0.000755 | 12.36 | 91.87 |
| 38 | PA | 49.2 | 50.8 | 7.595324 | 3.672452 | 0.525624 | 0.067011 | 0.028097 | 3.302140 | 45 | 13002700 | 0.000346 | 11.95 | 91.89 |
| 39 | RI | 48.9 | 51.1 | 15.882711 | 7.553427 | 0.994141 | 0.142938 | 0.064284 | 7.127920 | 2 | 1097379 | 0.000182 | 11.58 | 89.14 |
| 40 | SC | 48.2 | 51.8 | 5.831209 | 3.068201 | 0.185289 | 0.056290 | 0.015182 | 2.506247 | 42 | 5118425 | 0.000821 | 14.68 | 89.61 |
| 41 | SD | 50.7 | 49.3 | 4.104006 | 2.263071 | 0.045489 | 0.322175 | 0.015694 | 1.457577 | 9 | 886667 | 0.001015 | 12.81 | 93.05 |
| 42 | TN | 48.7 | 51.3 | 5.569213 | 3.356468 | 0.150703 | 0.049673 | 0.016080 | 1.996288 | 56 | 6910840 | 0.000810 | 14.62 | 89.74 |
| 43 | TX | 49.5 | 50.5 | 39.441532 | 27.780769 | 0.338654 | 0.254192 | 0.065342 | 11.002575 | 203 | 29145505 | 0.000697 | 14.22 | 85.39 |
| 44 | UT | 50.5 | 49.5 | 14.155289 | 7.225380 | 0.097644 | 0.174154 | 0.035827 | 6.622284 | 22 | 3205958 | 0.000686 | 9.13 | 93.17 |
| 45 | VT | 49.6 | 50.4 | 2.004997 | 1.242752 | 0.074959 | 0.053657 | 0.013454 | 0.620175 | 3 | 643077 | 0.000467 | 10.78 | 94.55 |
| 46 | VA | 48.8 | 51.2 | 9.527981 | 5.101442 | 0.348734 | 0.066950 | 0.066891 | 3.943964 | 43 | 8631393 | 0.000498 | 10.01 | 91.38 |
| 47 | WA | 49.9 | 50.1 | 12.932133 | 6.069539 | 0.140100 | 0.214164 | 0.084180 | 6.424150 | 52 | 7705281 | 0.000675 | 10.19 | 92.35 |
| 48 | WV | 49.4 | 50.6 | 1.586732 | 0.972322 | 0.063737 | 0.004094 | 0.015492 | 0.531087 | 22 | 1793716 | 0.001227 | 17.10 | 88.82 |
| 49 | WI | 49.9 | 50.1 | 7.030631 | 3.685103 | 0.126796 | 0.083417 | 0.023868 | 3.111448 | 42 | 5893718 | 0.000713 | 10.97 | 93.33 |
| 50 | WY | 51.1 | 48.9 | 10.123712 | 6.782512 | 0.063129 | 0.261633 | 0.025114 | 2.991324 | 7 | 576851 | 0.001213 | 10.76 | 93.59 |
#Creating a heat map of the "state_info" which contains information on the correlations between the different columns
fig, ax = plt.subplots(figsize=(10,10))
sns.heatmap(state_info.corr(), annot = True, linewidths=.5, ax=ax)
<AxesSubplot:>
We can see there is a fairly strong linear relationship between the percent of victims of the entire population and percent male, percent hispanic, percent white, percent native, and percent poverty rates. This is just a measure of linear relationships and it is possible that other variables might have a nonlinear relationship. That said, non-linear relationships should only be used when you know that it works because of some sort of statistical test.
#The beginning of this code is the same as before. Python was having issues refrencing then, so they were redefined
race_killings_counts = police_killings["Race"].value_counts()
race_killings_counts = dict(race_killings_counts)
race_population = {'White': (0.593 * 331893745), 'Black': (0.136 * 331893745), 'Hispanic': (0.189 * 331893745), 'Asian': (0.061 * 331893745), 'Native': (0.013 * 331893745), 'Other': 2655149.96}
races = race_killings_counts.keys()
killings = race_killings_counts.values()
population = race_population.values()
#Creating a plot for the number of police killings per race and the total population per race
fig, axs = plt.subplots(2, figsize=(10, 10))
axs[0].bar(races, killings)
axs[0].set_title("Number of Victims of Fatal Police Brutality, per Race")
axs[0].set(xlabel='Race', ylabel='Number of Victims')
axs[1].bar(races, population)
axs[1].set_title("Population, per Race")
axs[1].set(xlabel='Race', ylabel='Population')
plt.show()
#Creating a new dictionary which uses the "race_killings_counts" and "race_population" defined earlier. It makes the keys the same as before, but the values the proportion of the associated values in "race_killings_counts" and "race_population"
race_killing_proportion = {}
for key in race_killings_counts:
race_killing_proportion[key] = race_killings_counts[key] / race_population[key]
race = race_killing_proportion.keys()
proportions = race_killing_proportion.values()
#Creating a plot for the proportion of race population that were victims of fatal police brutality
plt.figure(figsize=(8, 8))
plt.bar(race, proportions)
plt.title("Proportion of Race Victims of Fatal Police Brutality")
plt.xlabel("Race")
plt.ylabel("Proportion of Race")
plt.show()
White people are the most common victims of police brutality, as measured by direct counts. However, it is important to note that white people might be the most common victims of police brutality because there is just the highest population of white people in general. Instead, I plotted the proportion of the population of the race who were victims of police brutality and found that the proportion of black people who were victims of police brutality was far greater than the other races, which suggests that police might have a racial bias.
#Plotting the number of entries in "police_killings" that were female/male using value_counts
gender_counts = police_killings["Gender"].value_counts()
plt.figure(figsize=(8, 8))
gender_counts.plot.pie(autopct="%.2f%%")
plt.title('Gender Distribution of Victims of Fatal Police Brutality')
plt.show()
Males are the most common victims of fatal police brutality; ~96% of my dataset on victims of police brutality were men. This suggests bias and stereotypes against men, especially men of color, this is known as gender based policing.
#Plotting the number of entries in "police_killings" that displayed signs of mental illness using value_counts
mental_illness = police_killings["Has Signs of Mental Illness"].value_counts()
plt.figure(figsize=(8, 8))
mental_illness.plot.pie(autopct="%.2f%%")
plt.title('Signs of Mental Illness Distribution of Victims of Fatal Police Brutality')
plt.show()
~25% of victims of fatal police brutality displayed signs of mental illness. This could suggest that police officers and law enforcement might not be well-trained in recognizing and responding to people with mental illness. Additionally, the stigma and discrimination against people with mental illness might played a part in this.
#Plotting the number of entries in "police_killings" where the police officer didn't have their body camera on using value_counts
body_cam = police_killings["Has Body Camera"].value_counts()
plt.figure(figsize=(8, 8))
body_cam.plot.pie(autopct="%.2f%%")
plt.title('Body Camera On Distribution')
plt.show()
~89% of the times that police officers were involved with fatal police brutality, their camera was off. Police are supposed to wear their body cameras to promote accountability and professionalism. However, in many cases of fatal police brutality, police don’t have their cameras on, ridding them of evidence and thus accountability.
#The beginning of this code is the same as before. Python was having issues refrencing then, so they were redefined
states = state_killing_counts.keys()
killings = state_killing_counts.values()
population = state_population.values()
#Creating a plot for the number of police killings per state and the total population per state
fig, axs = plt.subplots(2, figsize=(15, 15))
axs[0].bar(states, killings)
axs[0].set_title("Number of Victims of Fatal Police Brutality, per State")
axs[0].set(xlabel='State', ylabel='Number of Victims')
axs[1].bar(states, population)
axs[1].set_title("Population, per State")
axs[1].set(xlabel='State', ylabel='Population')
[Text(0.5, 0, 'State'), Text(0, 0.5, 'Population')]
States such as California and Texas, which had the greatest population also had the greatest number of victims, which makes sense.
#The beginning of this code is the same as before. Python was having issues refrencing it, so it was redefined
poverty = state_poverty.values()
#Creating a plot for the number of police killings per state and the poverty rate per state
fig, axs = plt.subplots(2, figsize=(15, 15))
axs[0].bar(states, killings)
axs[0].set_title("Number of Victims of Fatal Police Brutality, per State")
axs[0].set(xlabel='State', ylabel='Number of Victims')
axs[1].bar(states, poverty)
axs[1].set_title("Poverty Rate, per State")
axs[1].set(xlabel='State', ylabel='Poverty Rate (%)')
[Text(0.5, 0, 'State'), Text(0, 0.5, 'Poverty Rate (%)')]
There is no significant or visible relationship between poverty rates and number of victims of fatal police brutality. The states with the greatest/lowest number of victims weren’t paired with some sort of trend in the poverty rates.
#The beginning of this code is the same as before. Python was having issues refrencing it, so it was redefined
diploma = state_diploma.values()
#Creating a plot for the number of police killings per state and the percent of people >25 y/o with a diploma per state
fig, axs = plt.subplots(2, figsize=(15, 15))
axs[0].bar(states, killings)
axs[0].set_title("Number of Victims of Fatal Police Brutality, per State")
axs[0].set(xlabel='State', ylabel='Number of Victims')
axs[1].bar(states, diploma)
axs[1].set_title("Percent of People >25 y/o with Diploma, per State")
axs[1].set(xlabel='State', ylabel='Percent of People >25 y/o with Diploma (%)')
axs[1].set_ylim([80, 100])
(80.0, 100.0)
States such as California and Texas which have the lowest percent of people >25 y/o with a diploma also had the greatest number of victims. This is possibly because it is easier for people with more advanced levels of education to advocate for themselves and seek justice.
state = state_killing_counts.keys()
num_entries = state_killing_counts.values()
#Making the state_killing_counts dictionary into a dataframe, which is very compatible with choropleth
state_occurence = pd.DataFrame(list(zip(state, num_entries)),columns=['State','Occurences'])
#Plotting the number of victims of fatal police brutality in different states using plotly express's choropleth
fig = px.choropleth(state_occurence, locations='State', locationmode="USA-states", scope="usa",color='Occurences',
color_continuous_scale="Viridis_r", )
fig.show()
Southern states such as California, Arizona, Texas, and Florida have the highest occurrences of fatal police brutality. This might have to do with this region's policies, as well as the shared biases and stereotypes of people in these regions.
I wanted to understand to better understand how different demographic factors such as percent male, percent female, percent hispanic, percent white, percent black, percent native, percent asian, percent other races, percent poverty rate, and percent diploma holders can be used to determine of severity of police brutality, as measured by percent victims of total state population; hence I made the former the features of the model and the latter the target of the model. It felt more fitting to use a regression in this case because I wanted to use the features to predict a more continuous variable, which is the severity of police brutality. I used hold-out validation, which resulted in me using the training set (which was 75% of the data) to train the model and using the testing set (which was 25% of the data) to test the model. I personally wish I had a larger data set because my dataset which focused on the different states was very limiting.
#Defining the feature variables which will be used to predict the target variable
X = state_info.drop(columns=['State', 'Victim Count', 'Population', 'Percent Victims of Population'])
#Defining the target variavle
y = state_info['Percent Victims of Population']
#Creating the testing and training sets, such that the testing set is 25% of the entire set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
#Running an OLS regression
model = sm.OLS(y_train, X_train)
results = model.fit()
y_pred = results.predict(X_test)
#Printing out the summary statistics of said OLS regression
print(results.summary())
OLS Regression Results
=========================================================================================
Dep. Variable: Percent Victims of Population R-squared: 0.755
Model: OLS Adj. R-squared: 0.687
Method: Least Squares F-statistic: 11.17
Date: Fri, 16 Dec 2022 Prob (F-statistic): 4.76e-07
Time: 10:24:40 Log-Likelihood: 272.29
No. Observations: 38 AIC: -526.6
Df Residuals: 29 BIC: -511.8
Df Model: 8
Covariance Type: nonrobust
===========================================================================================
coef std err t P>|t| [0.025 0.975]
-------------------------------------------------------------------------------------------
Percent Male 0.0001 7.44e-05 1.511 0.142 -3.98e-05 0.000
Percent Female 0.0001 5.41e-05 2.192 0.037 7.93e-06 0.000
Percent Hispanic 0.0004 0.000 4.169 0.000 0.000 0.001
Percent White -0.0004 0.000 -3.819 0.001 -0.001 -0.000
Percent Black -0.0009 0.000 -4.817 0.000 -0.001 -0.001
Percent Native 0.0019 0.000 4.813 0.000 0.001 0.003
Percent Asian 0.0004 0.000 1.360 0.184 -0.000 0.001
Percent Other Races -0.0005 0.000 -4.529 0.000 -0.001 -0.000
Percent Poverty Rate -2.808e-05 3.93e-05 -0.714 0.481 -0.000 5.23e-05
Percent Diploma Holders -0.0001 5.17e-05 -2.208 0.035 -0.000 -8.42e-06
==============================================================================
Omnibus: 0.784 Durbin-Watson: 1.541
Prob(Omnibus): 0.676 Jarque-Bera (JB): 0.481
Skew: -0.275 Prob(JB): 0.786
Kurtosis: 2.965 Cond. No. 1.93e+17
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 1.4e-29. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
There are some important insights in the OLS regression’s summary including: 1) The r-squared value is 0.755 which means that 75% of the variation in the target variable can be explained by the different feature variables. 2) The p-value is 4.76e-0.7 which is less than the alpha of 0.05 which means there’s a statistically significant relationship between the feature variables and the target variable. 3) Some features are more significant than others including percent female, percent hispanic, percent white, percent black, percent native, percent other races, and percent diploma holders because again, those individual variables have a p-value of less than alpha. To reiterate, the feature variables of this model are percent male, percent female, percent hispanic, percent white, percent black, percent native, percent asian, percent other races, percent poverty rate, and percent diploma holders and the target variable is the percent victims of population.